Introduction to Triton Programming: Beyond Elementwise: The Shift to Tiled Matrix Operations

In previous lessons, we focused on elementwise operations (like a basic ReLU on a matrix). These are memory-bound because the GPU spends more time moving data from HBM to registers than performing math.

1. Why GEMM is Central

General Matrix Multiplication (GEMM) has a computational complexity of $O(N^3)$ while only requiring $O(N^2)$ memory access. This allows us to hide memory latency behind massive arithmetic throughput, making it the "heartbeat" of LLMs.

2. 2D Memory Representation

Physical RAM is 1D. To represent a 2D tensor, we use Strides. A common production pitfall is assuming a tensor is contiguous. If you mix up row and column strides in your pointer math, you will access "ghost" data or trigger memory violations.

3. Tiled Generalization

Triton generalizes elementwise logic by shifting from single pointers to blocks of pointers. By using 2D tiles (e.g., $16 \times 16$), we exploit data reuse in high-speed SRAM, keeping data "hot" for fused operations like Bias addition or activations before writing back to Global Memory.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Why is an elementwise ReLU on a large matrix considered 'memory-bound'?

The ReLU function requires complex transcendental math.

The ratio of arithmetic operations to memory loads is very low (1:1).

Matrices are naturally stored in CPU memory only.

Triton cannot process non-linear activations.

QUESTION 2

What is the result of 'The Stride Trap' in production kernels?

The kernel runs significantly faster but with less precision.

Memory access violations or corrupted output due to incorrect address calculation on non-contiguous tensors.

The GPU automatically corrects the indexing using L2 cache.

The tensor is forced into a 1D shape by the compiler.

QUESTION 3

How does Triton represent a 2D tile of pointers?

By using a nested Python list of integers.

By broadcasting a 1D column vector and a 1D row vector of offsets together.

By launching multiple 1D kernels sequentially.

By allocating a special 2D register file.

QUESTION 4

Which operation benefits most from the O(N³) complexity shift to hide memory latency?

Vector Addition

Matrix Multiplication (GEMM)

Sigmoid Activation

Global Average Pooling

QUESTION 5

List three kernels in your current workflow that launch multiple PyTorch ops and might benefit from fusion.

Linear -> Bias -> ReLU; LayerNorm -> Dropout; Softmax -> Masking.

Print -> Log -> Sleep.

DataLoader -> Augmentation -> Storage.

These ops cannot be fused in Triton.

Case Study: The Contiguity Crisis

Debugging non-contiguous tensor access in production

A developer writes a custom Triton kernel for a Linear Layer. On standard training data, it works perfectly. However, during inference, the input tensor is frequently 'sliced' (e.g., `x[:, :hidden_dim/2]`), which changes its stride without changing its memory layout. The kernel begins outputting 'NaN' and random noise.

Why did the kernel fail when the input was sliced?

Solution:
Slicing usually creates a non-contiguous view. If the kernel assumed the row stride was equal to the number of columns (width), but the physical memory jump to the next row remained the original width, the kernel would read 'stale' data from the unsliced portion of memory.

How should the pointer arithmetic be updated to handle this?

Solution:
The kernel must accept `stride_m` and `stride_n` as arguments. Instead of `ptr = base + i * width + j`, it must use `ptr = base + i * stride_m + j * stride_n` to respect the actual memory mapping.